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Abstract 


Background: Tandem mass spectrometry (MS/MS) acts as a key technique for peptide identification. The MS/MS- 
based peptide identification approaches can be categorized into two families, namely, de novo and database search. 
Both of the two types of approaches can benefit from an accurate prediction of theoretical spectrum. A theoretical 
spectrum consists of m/z and intensity of possibly occurring ions, which are estimated via simulating the spectrum 
generating process. Extensive researches have been conducted for theoretical spectrum prediction; however, the 
prediction methods suffer from low prediciton accuracy due to oversimplifications in the spectrum simulation process. 


Results: In the study, we present an open-source software package, called OpenMS-Simulator, to predict theoretical 
spectrum for a given peptide sequence. Based on the mobile-proton hypothesis for peptide fragmentation, 
OpenMS-Simulator trained a closed-form model for the intensity ratio of adjacent y ions, from which the whole 
theoretical spectrum can be constructed. On a collection of representative spectra datasets with annotated peptide 
sequences, experimental results suggest that OpenMS-Simulator can predict theoretical spectra with considerable 
accuracy. The study also presents an application of OpenMS-Simulator: the similarity between theoretical spectra and 
query spectra can be used to re-rank the peptide sequence reported by SEQUEST/X!Tandem. 


Conclusions: OpenMS-Simulator implements a novel model to predict theoretical spectrum for a given peptide 
sequence. Compared with existing theoretical spectrum prediction tools, say MassAnalyzer and MSSimulator, our 
method not only simplifies the computation process, but also improves the prediction accuracy. 

Currently, OpenMS-Simulator supports the prediction of CID and HCD spectrum for peptides with double charges. 
The extension to cover more fragmentation models and support multiple-charged peptides remains as one of the 
future works. 
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Background 

Tandem mass spectrometry (MS/MS) has been consid- 
ered as an indispensable technique for high-throughput 
peptide identification and characterization in the field 
of proteomics [1]. Extensive researches have been con- 
ducted for peptide identification, and a collection of soft- 
ware packages have been developed, such as SEQUEST 
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[2], MASCOT [3], X!Tandem [4], SCOPE [5], pFind [6], 
PEAKS DB [7], etc. 

The MS/MS-based peptide identification approaches 
can be categorized into two families: (1) database search- 
ing approaches: for each peptide sequence in a database, 
the corresponding theoretical spectrum is predicted and 
compared against the query experimental spectrum. The 
most similar peptide-spectrum match (PSM) is reported 
as the final identification result. (2) de novo identifica- 
tion approaches: unlike the database search strategy, the 
de novo approach does not require a peptide sequence 
database as input. In essence, de novo approach can be 
treated as a search process working on a virtual peptide 
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sequence database — the virtual database consists of all 
possible peptide sequences with the same precursor mass 
to the query experimental spectrum. 

Accurate prediction of theoretical spectrum, including 
m/z and intensities of possibly occurring ions, is impor- 
tant to both database search and de novo identification 
approaches. Although theoretically possible, the accu- 
rate prediction of theoretical spectrum still remains a 
challenge due to the lack of deep understanding of the 
complex physical-chemical peptide fragmentation pro- 
cess during a MS/MS experiment. Therefore, most exist- 
ing peptide identification tools employ an over-simplified 
model to simulate the peptide fragmentation process, 
leading to an inaccurate estimation of the ion intensi- 
ties. Taking SEQUEST as an example, all y-ions are given 
a fixed intensity, regardless of the factors with substan- 
tial effects on the peptide fragmentation process, such as 
amino acid type and fragmentation sites, etc. 

The relationship between peptide sequences and ion 
intensities has been studied to improve the accuracy 
of theoretical spectrum prediction [8-12]. A pioneer 
research of these works is the kinetic model used in 
MassAnalyzer, which simulates the peptide fragmenta- 
tion pathways based on the “mobile proton” hypothesis. 
Another prediction method, MSSimulator, employs the 
support vector regression technique to predict the likeli- 
hood that an ion appears in a spectrum [13]. 

Based on the “mobile proton” peptide fragmentation 
model, we have proposed a novel theoretical spectrum 
prediction approach called MS-Simulator [14]. Unlike the 
existing approaches to predict ion intensities directly, MS- 
Simulator aims to predict the intensity ratio of every 
two adjacent y-ions. In brief, the intensity of a y-ion is 
determined by both near neighbouring amino acids and 
remote amino acids. The remote amino acids, however, 
were observed to have approximately equal effects on ion 
intensities y; and y;+1, and thus can be canceled out when 
calculating intensity ratio +. In fact, only the two ter- 
mini of peptides were employed in MS-Simulator to cap- 
ture the effects of remote amino acids. Having acquired 
intensity ratios of all neighbouring ions, the whole spec- 
trum can be easily constructed. It should be pointed 
out that unlike the kinetic model used in MassAnalyzer 
[15,16], the intensity ratio used by MS-Simulator has a 
closed-form; thus, the computation process is significantly 
simplified and the prediction accuracy is also considerably 
improved. 

The study presents an open source package implemen- 
tation of MS-Simulator called OpenMS-Simulator, which 
can be freely downloaded through our website. 


Implementation and results 
OpenMS-Simulator package has four functionalities, 
namely, theoretical spectrum prediction, PSM re-ranking, 
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FDR analysis, and spectrum visualization. These function- 
alities are briefly described as follows: 


1. Theoretical spectrum prediction and 
spectrum visualization. 
OpenMS-Simulator takes peptide sequences as input 
and reports predicted theoretical spectra as output. A 
theoretical spectrum consists of y-ions and the 
corresponding isotopic derivatives. The current 
version of OpenMS-Simulator supports predicting 
theoretical spectra of both HCD (Higher-energy 
collisional dissociation) and CID (Collision-induced 
dissociation) types. 

OpenMS-Simulator provides the visualization of 
spectra by labelling all peaks with ion types. In 
addition, both theoretical spectrum and its 
experimental counterpart are shown in one frame to 
clearly display their similarity and difference. Pearson 
correlation coefficient (Pearson CC) is also calculated 
as a quantitative measure of the similarity (see Figure 
1 for an example). 

2. PSM re-ranking and FDR analysis. 
OpenMS-Simulator can also be used to re-rank the 
PSMs reported by SEQUEST or X!Tandem. More 
specifically, SEQUEST usually reports a 
peptide-spectrum match together with two scores, 
namely, Xcorr and AC, to measure the likelihood 
that the query spectrum is generated from the 
peptide. OpenMS-Simulator combines the two 
scores with Pearson CC to yield a new score, i.e. 
Xcorr + 5 * ACn + 5 * CC. For a PSM reported by 
X!Tandem, OpenMS-Simulator utilizes the score 
#SharedPeaks x ./St * CC + Sr, where #SharedPeaks 
denotes the number of peaks shared by experimental 
and predicted spectrum, and Sr refers to the score 
reported by X!Tandem. The new score is employed 
to re-rank PSMs reported by SEQUEST/X!Tandem. 
OpenMS-Simulator provides the functionality called 
FDR (False Discovery Rate) analysis to evaluate the 
performance of PSMs re-ranking. In particular, two 
FDR curves are drawn: one curve is calculated based 
on the original ranks given by SEQUEST/X!Tandem, 
and another curve is calculated according to the new 
score calculated by OpenMS-Simulator. This way, 
the improvement of re-ranking strategy can be 
intuitively demonstrated. The FDR was estimated 
using the decoy count method; that is: 


FP 
FDR = ——_ 
TP + FP 
where FP denotes the number of false-positive 
peptide identifications, and TP denotes the number 
of true-positive identifications. 
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The theoretical spectrum prediction performance of 
OpenMS-Simulator is evaluated on the SwedCAD 7 
T_LTQ-FT dataset (downloaded from http://www.bmms. 
uu.se/CAD/download.html). The dataset consists of 
15,897 unmodified, doubly charged CID spectra together 
with highly confident peptide sequence annotations. On 
the dataset, the average Pearson CC between experi- 
mental and theoretical spectrum predicted by OpenMS- 
Simulator is as high as 0.890. In Figure 1, the theoretical 
spectrum predicted for peptide ETELEDPLENMGAQMVK 
is shown as an example. 

We also evaluated OpenMS-Simulator by comparing 
with two other theoretical spectrum prediction models, 
namely MSSimulator that uses the support vector regres- 
sion technique, and MassAnalyzer that uses a kinetic 
model (see Table 1). To make a fair comparison, we per- 
formed the evaluation on the dataset used by MSSimula- 
tor [13], which contains 15,324 doubly charged ion trap 


Table 1 Comparison of three theoretical spectrum 
prediction models 


ae Models 
Similarity 
MSSimulator | MassAnalyzer © OpenMS-Simulator 
Mean 0.864 0.896 0.926 
Variance 0.088 0.102 0.006 


Dataset: 15,324 doubly charged spectra with peptide sequence annotations 
used by MSSimulator [13]. 


mass spectra. The prediction accuracy is measured by the 
similarity between experimental spectrum and theoretical 
prediction. Specifically, we used the following similarity 
measure suggested by MassAnalyzer [16]. 


> VE 
VEREER) 


where I}, and IŻ, denote the intensities of the ions with 
m/z of m in the corresponding spectra. 

The re-ranking efficiency is evaluated on datasets 
PAe000350, PAe000351, and PAe003641 (downloaded 
from http://www.peptideatlas.org/repository). For each 
spectrum in PAe000350 or PAe000351, SEQUEST was 
executed to generate the most likely peptide sequence, 
and for spectra in PAe003641, X!Tandem was executed 
to give the peptide identification results. Subsequently 
the PSMs were re-ranked by running OpenMS-Simulator. 
FDR analysis suggests that by using the re-ranking strat- 
egy, the correctly identified PSM number can be signif- 
icantly improved (see Figures 2, 3 and 4). In particular, 
when FDR is set as 0.005, SEQUEST can correctly iden- 
tify 7,983 PSMs, while OpenMS-Simulator can correctly 
identify 12,001 PSMs (see Tables 2 and 3). 


similarity = 


Model and parameters 
Though intensive research has been conducted for pep- 
tide fragmentation, it is still not fully understood how a 
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Figure 2 FDR curves of OpenMS-Simulator and SEQUEST on dataset PAe000350. 


peptide fragments during mass spectrometry. Till now, avoid repetitions, only the extensions and modifications 
the “mobile proton” hypothesis is one of the most widely- are listed as below: 
accepted explanations of the peptide fragmentation pro- 


cess, which consists of a collection of main peptide frag- 1. The previous version of MS-Simulator supports 

mentation pathways. prediction of theoretical CID spectrum only. 
Based on the “mobile proton” hypothesis, OpenMS- OpenMS-Simulator has an extension to support 

Simulator employs a statistical model to predict intensity prediction of HCD spectrum. 

for possible ions, which extends our previous work MS- 2. Compared with the previous version of 

Simulator with several extensions and modifications. To MS-Simulator, more fragmentation pathways are 
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Figure 3 FDR curves of OpenMS-Simulator and SEQUEST on dataset PAe000351. 
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Figure 4 FDR curves of OpenMS-Simulator and X!Tandem on dataset PAe003641. 


taken into consideration in OpenMS-Simulator. 
Specifically, besides the common by — yz 
fragmentation pathway, the diketopiperazine 
pathway [8,17] was also incorporated in the model, 
enabling an accurate intensity prediction for y,—1 
ion. The probability of the two pathways are denoted 
as F(A;) and D(Aj, i), respectively. Thus, the Eq. 1 in 
MS-Simulator model was improved to be: 


Yi 
Yi+1 
—In(F(Ai41) + D(i+1i + 1)) 


In =$ x (Ei — Eit) + ln(F(4;) + D(A; i) 


Unlike MS-Simulator utilizing 5 consecutive 
neighbouring amino acids around the concerned y; 
ion, only 4 neighbouring amino acids are used by 
OpenMS-Simulator to build prediction model. This 
way, the number of parameters is reduced with little 
influence on the prediction accuracy. The model 
parameters A(x, d) used in OpenMS-Simulator are 
summarized as follows: 


Table 2 The number of correctly identified PSMs by 
OpenMS-Simulator and SEQUEST on dataset PAe000350 
and PAe000351 


(1) —2 < d < 1 for x being any amino acid 
except for LYS or ARG. 

(2) —8 < d < 5 for x being LYS or ARG. 

(3) 0 < d < 10 for x being Cterm. 

(4) The effect of Nterm is divided into 5 levels 
according to d, i.e., 


A(Nterm, d) ~ A'(Nterm, s), where 


5xd 


s = n 


The estimated parameters can be found in Additional 
file 1. 


Conclusions 

We present an open source package OpenMS-Simulator 
implemented in Java language. OpenMS-Simulator can 
be used to accurately predict theoretical spectrum for 
a given peptide sequence. To show the performance 
of theoretical spectrum prediction, OpenMS-Simulator 
provides a functionality to re-rank PSMs reported by 
SEQUEST or X!Tandem. Experimental results suggest 
that the predicted theoretical spectrum help improve 
peptide identification. 


Table 3 The number of correctly identified PSMs by 


Daaa FDR = 0.005 FDR = 0.01 OpenMS-Simulator and X!Tandem on dataset PAe003641 
SEQUEST OpenMS- SEQUEST OpenMS- FDR = 0.005 FDR = 0.01 
Simulator Simulator Dataset 
X!Tandem OpenMSs- X!Tandem OpenMS- 
PAe000350 7,983 12,001 9,070 14,012 Simulator Simulator 
PAe000351 5,257 7,935 5,617 8,551 PAe003641 1,872 2,721 1,985 2,993 
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Availability and requirements 

Project name: OpenMS-Simulator. 

Project home page: http://www.bioinfo.org.cn/OpenMS- 
Simulator. 

Operating system: Platform independent. 
Programming language: Java. 

Other requirements: Java 1.6 or higher. 

License: GNU GPL FreeBSD. 

Any restrictions to use by non-academics: licence 
needed. 


Additional file 


Additional file 1: Parameters estimation and long tables. 
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